It is useful to think of data as a distribution.
We can obtain summary statistics:
Where does a value of \(X = 5.5\) lie in the distribution?
Where does a value of \(X = 5.5\) lie in the distribution?
Suppose we have another sample…
We can take the mean as our orientation.
… then we can say how close/far a value is from the mean
Idea: we locate a point relative to the mean in terms of SDs away.
\(z = \frac{X - \mu}{\sigma}\)
Suppose: \(\mu = 7\) and \(\sigma = 1\)
For our value of 5.5:
\(z = \frac{X - \mu}{\sigma} = \frac{5.5 - 7}{1} = -1.5\)
A value of 5.5 in our data has a z-score of -1.50.
Red distribution: \(X \sim N(\mu, \sigma)\) –> \(X \sim N(7.00, 1.00)\)
Blue distribution: \(X \sim N(\mu, \sigma)\) –> \(X \sim N(7.00, 0.50)\)
Say we wanted to project these data to a new distribution:
| id | value | z-score | new value |
|---|---|---|---|
| 1 | 6.0 | -1.0 | 90 |
| 2 | 4.5 | -2.5 | 75 |
| 3 | 9.5 | 2.5 | 125 |
| 4 | 7.5 | 0.5 | 105 |
| 5 | 5.5 | -1.5 | 85 |
We can infer \(z\), \(\mu\), \(X\) and \(\sigma\) from the z-score formula:
\(z = \frac{X - \mu}{\sigma}\), i.e.
\(X = \mu + z\sigma\), and
\(-\mu = z\sigma - X\), and
\(\sigma = \frac{X-\mu}{z}\)
When we “standardise” the distribution, how does it affect the mean \(\mu\) and standard deviation \(\sigma\)?
Take this population with \(\mu=3\) and \(\sigma=0.80\)
| id | value | z |
|---|---|---|
| 1 | 1 | -2.50 |
| 2 | 2 | -1.25 |
| 3 | 3 | 0.00 |
| 4 | 4 | 1.25 |
| 5 | 5 | 2.50 |
This results in:
\(\mu = \frac{-2.50-1.25+0.00+1.25+2.50}{5} = \frac{0}{5} = 0\)
\(\sigma^2 = \frac{SS}{N} = \frac{(-2.50)^2+(-1.25)^2+(0.00)^2+(1.25)^2+(2.50)^2}{5} = \frac{5}{5} = 1\)
Suppose …
Simplest form:
Requires random sampling (see page 163)!
We know the guessing probability: 0.50 (or 50%).
\(P(1st\ correct\ and\ 2nd\ correct)=0.50*0.50 = 0.25\)
Thus for 10 correct predictions:
\(P(correct)*P(correct)*P(correct)*...\) –> \(P(correct)^{10}\)
\(P(0.50)^{10} = 0.0009765625\) or 1/1024
A great scam!
What is the probability that two people have the same birthday in a class of 10/25/50 students?
We’ll solve this stepwise in the live session
Maria is 26 years old, single, outspoken, and very bright. She majored in law. As a student, she was deeply concerned with issues of discrimination and miscarriage of justice and participated in weekly animal-rights demonstrations.
Which is more probable?
Formalising the problem:
Why is \(P(B) < P(A)\)?
\(P(B)\) –> \(P(A)\) + does pro bono work for animal-rights activists
Let does pro bono work for animal-rights activists be \(P(C)\)
Two events occuring together is less probable than each event happening individually (if they are independent).
So \(P(B) = P(A \cap C) = P(A)*P(C)\)
Suppose:
What we are after is: probability of TERRORIST given that there is an ALARM
In probabiliy notation this is expressed as: \(P(T \mid A)\)
| Terrorist | Passenger | |||
|---|---|---|---|---|
| Terrorist | 950 | 50 | 1,000 | |
| Passenger | 4,950 | 94,050 | 99,000 | |
| 5,900 | 94,100 | 100,000 |
\(P(terrorist \mid alarm) = 950/5900 = 16.10%\)
Defined by two parameters:
Note: a normal distribution is always bell-shaped, but not every bell-shaped distribution is a normal distribution.
We can locate each y-value.
Each x-value corresponds to a probability through the probability density function (PDF):
\(Y = \frac{1}{\sqrt{2\pi\sigma^2}}e^\frac{-(X-\mu)^2}{2\sigma^2}\)
e.g. for \(X = 3\) in \(N(0,1)\)
\(Y = \frac{1}{\sqrt{2\pi}}e^\frac{-(3)^2}{2} = \frac{1}{2.51}e^{-4.5} = \frac{1}{2.51}*0.01 = 0.0039\)
i.e. the probability of \(X=3\) under the standard normal is ~0.39%.
We can apply the PDF and obtain the exact shape of the normal distribution.
But we don’t have to do this
There is a nice relationship between the distribution and z-scores.
And we can describe the portions of the function in terms of z-scores.
We can calculate the area covered between two x-values.
We can calculate the area covered between two x-values.
We don’t need to because we know how these areas relate to z-scores:
| z | Prop in body | Prop in tail | Prop between M and z |
|---|---|---|---|
| 1.00 | 0.8413 | 0.1587 | 0.3413 |
| 1.96 | 0.9759 | 0.0250 | 0.4750 |
For a standard normal, how likely is it to obtain a value of \(X=0.5\)?
Note: we really need to ask is “areas” how likely is it to obtain a value of at most 0.5?
How likely is it to obtain a value of at most 0.5?
| z | Prop in body | Prop in tail | Prop between M and z |
|---|---|---|---|
| 0.50 | 0.6915 | 0.3085 | 0.1915 |
The green area corresponds to the proportion in the body = 69.15%.
A value of at most 0.5 (i.e. 0.5 or lower) has a probability of 69.15%.
How likely is it to obtain a value of at least 0.5?
Note: this means “0.5 or higher”"
How likely is it to obtain a value of at least 0.5?
| z | Prop in body | Prop in tail | Prop between M and z |
|---|---|---|---|
| 0.50 | 0.6915 | 0.3085 | 0.1915 |
The green area corresponds to the proportion in the tail = 30.85%.
A value of at least 0.5 (i.e. 0.5 or higher) has a probability of 30.85%.
How likely is it to be taller than 1.90m?
We’ll do this in-depth in the live session.
More in the live session
We call these data binomial data.
And the corresponding distribution the binomial distribution.
2 possible outcomes A and B.
Because we have only two outcomes, \(P(A) + P(B) = 1\), so
50/50 chance
Let’s denote \(p\) as a correct guess.
So if I guess once: \(p = 0.50\)
2 guesses: now we have four outcomes
So we can count:
Is this expected?
How (un)likely is that?
What if we did this, say, 1,000 times…
Described formally through by two parameters:
\(X \sim B(n, p)\)
Note: when \(n=1\), the binomial distribution is called the Bernoulli distribution.
Approaches normal with increasing \(n\). Then:
\(\mu = pn\), and
\(\sigma = \sqrt{npq}\)
We can then also go back to z!
\(z = \frac{X-\mu}{\sigma} = \frac{X-pn}{\sqrt{npq}}\)
We know:
So:
\(\mu = pn = 0.5*10 = 5\)
\(\sigma = \sqrt{npq} = \sqrt{10*0.5*0.5} = \sqrt{2.5} = 1.58\)
\(X=2\)
\(z = \frac{X-\mu}{\sigma} = \frac{X-pn}{\sqrt{npq}} = \frac{2-5}{1.58} = -1.90\)
At least 2: looking at the body prob. in the unit table: 0.9713 (97.13%)
At most 2: looking at the tail prob. in the unit table: 0.0287 (2.87%)
\(X=10\)
\(z = \frac{X-\mu}{\sigma} = \frac{X-pn}{\sqrt{npq}} = \frac{10-5}{1.58} = 3.16\)
Looking at the tail prob. in the unit table: 0.0008 (0.08%)
By rounding = 1/1024!